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Abstract 

The problem of inferring a clustering of a data set has been the sub- 
ject of much research in Bayesian analysis, and there currently exists a 
solid mathematical foundation for Bayesian approaches to clustering. In 
particular, the class of probability distributions over partitions of a data 
set has been characterized in a number of ways, including via exchange- 
able partition probability functions (EPPFs) and the Kingman paintbox. 
Here, we develop a generalization of the clustering problem, called fea- 
ture allocation, where we allow each data point to belong to an arbi- 
trary, non-negative integer number of groups, now called features or top- 
ics. We define and study an "exchangeable feature probability function" 
(EFPF) — analogous to the EPPF in the clustering setting — for certain 
types of feature models. Moreover, we introduce a "feature paintbox" 
characterization — analogous to the Kingman paintbox for clustering — of 
the class of exchangeable feature models. We provide a further character- 
ization of the subclass of feature allocations that have EFPF representa- 
tions. 

1 Introduction 

Exchangeability has played a key role in the development of Bayesian analysis in 
general and Bayesian nonparametric analysis in particular. Exchangeability can 
be viewed as asserting that the indices used to label the data points are irrelevant 
for inference, and as such is often a natural modelin g assumption. Un der such an 



assumption, one is licensed by dc Finctti's theorem (IDe Finettil . 119311 ) to propose 
the existence of an underlying parameter that renders the data conditionally 
independent and identically distributed (iid) and to place a prior distribution on 
that parameter. Moreover, the theory of infinitely exchangeable sequences has 
advantages of simplicity over the theory of finite exchangeability, encouraging 
modelers to take a nonparametric stance in which the underlying "parameter" 
is infinite dimensional. Finally, the development of algorithms for posterior 
inference is often greatly simplified by the assumption of exchangeability, most 
notably in the case of Bayesian nonparametrics, where models based on the 
Dirichlet process and other combinatorial priors became useful tools in practice 
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only when i t was realized how to exploit exchangeability to develop inference 
procedures (jEscobarl Il994l ). 

The connection of exchangeability to Bayesian nonparametric modeling is 
well established in the case of models for clustering. The goal of a clustering 
procedure is to infer a partition of the data points. In the Bayesian setting, one 
works with random partitions, and, under an exchangeability assumption, the 
distribution on partitions should be invariant to a relabeling of the data points. 
The notion of an exchan geable random partition has b een formalized by King- 
man, Aldous, and others ( Kingman . 1978HAldousl . 19851 ). and ha s led to the def - 
inition of an exchangeable partition probability function (EPPF) (|Pitmanl . ll995 ). 
The EPPF is a mathematical function of the cardinalities of the groups in a par- 
tition. Exchangeability of the random partition is captured by the requirement 
that the EPPF be a symmetric function of these cardinalities. Furthermore, the 
exchangeability of a partition can be related to the exchangeability of a sequence 
of random variables representing the assignments of data points to clusters, for 
which a de Finetti mixing measure necessarily exist s. This de Finetti measure 
is known as the Kingman paintbox (jKingman . Il978h . The relationships among 
this circle of ideas are well understood: it is known that there is an equivalence 
among the class of exchangeable random partitions, the class of random parti- 
tions that possess an EP PF, and the cl ass of random partitions generated by 
a Kingman paintbox; see iPitmanl (|2006l ) for an overview of these relations. A 
specific example of these relationships is given by the Chinese restaurant pro- 
cess and the Dirichlet process, but several other examples are known and have 
proven useful in Bayesian nonpar ametrics. 

Our focus in the current paper is on an alternative to clustering models that 
we refer to as feature allocation models. While in a clustering model each data 
point is assigned to one and only one class, in a feature allocation model each 
data point can belong to multiple groups. It is often natural to view the groups 
as corresponding to traits or features, such that the notion that a data point 
belongs to multiple groups corresponds to the point exhibiting multiple traits 
or features. A Bayesian feature allocation model treats the feature assignments 
for a given data point as random and subject to posterior inference. A nonpara- 
metric Bayesian feature allocation model takes the number of features to also 
be random and subject to inference. 

Research on nonparametric Bayesian feature alloc ation has been based around 
a sing le prior distribution, the Indian buffet process of lGrifhths and Ghahramani 



(|2006 ) , which is known to have th e beta process as its underlying de Finetti mea- 
sure (jThibaux and Jordan! . 120071) . There does not yet exist a general definition 



of exchangeability for feature allocation models, nor counterparts of the EPPF 
or the Kingman paintbox. 

In this paper we supply these missing constructions. We provide a rigorous 
treatment of exchangeable feature allocations (in Section [5] and Section [3]) . In 
Section^] we define a notion of exchangeable feature probability function (EFPF) 
that is the analogue for feature allocations of the EPPF for clustering. We then 
proceed to define a feature paintbox in Section [5] Finally, in Section |6] we 
discuss a class of models that we refer to as feature frequency models for which 
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Exchangeable FAs 

^ ^ Exchangeable RPs 
= RPs with EPPFs 
= Kingman paintbox models 



-FAs with EFPFs 
= Frequency models 
plus singletons 

-Regular FAs 

= Feature paintbox models 



CRP 



IBP Two-feature example 



Figure 1: A summary of the relations described in this paper. Rounded rectan- 
gles represent classes with the following abbreviations: RP for random partition, 
FA for random feature allocation, EPPF for exchangeable partition probabil- 
ity function, EFPF for exchangeable feature probability function. The large 
black dots represent particular models with the following abbreviations: CRP 
for Chinese restaurant process, IBP for Indian buffet process. The two-feature 
example refers to Example [9] with the choice puPoo 7^ PioPoi- 



the construction of the feature paintbox is particularly straightforward, and we 
discuss the important role that feature frequency models play in the general 
theory of feature allocations. 

The Venn diagram shown in Figure [1] is a useful guide for understanding our 
results, and the reader may wish to consult this diagram in working through 
the paper. As shown in the diagram, random partitions (RPs) are a special case 
of random feature allocations (FAs), and previous work on random partitions 
can be placed within our framework. Thus, in the diagram, we have depicted 
the equivalence already noted of exchangeable RPs, RPs that possess an EPPF, 
and Kingman paintboxes. We also see that random feature allocations have 
a somewhat richer structure: the class of FAs with EFPFs is not the same 
as those having an underlying feature paintbox. But the class of EFPFs is 
characterized in a different way; we will see that the class of feature allocations 
with EFPFs is equivalent to the class of FAs obtained from feature frequency 
models together with singletons of a certain distribution. Indeed, we will find 
that the class of clusterings with EPPFs is, in this way, analogous to the class 
of feature allocations with EFPFs when both are considered as subclasses of the 
general class of feature allocations. The diagram also shows several examples 
that we use to illustrate and develop our theory. 



2 Feature allocations 

We consider data sets with N points and let the points be indexed by the 
integers [N] := {1,2,..., N}. We also explicitly allow N = 00, in which case 
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the index set is N = {1, 2, 3, . . .}. For our discussion of feature allocations and 
partitioning it is sufficient to focus on the indices rather than the data points; 
thus, we will be discussing models for collections of subsets of \N] and N. 



Our introduction to feature allocations follows iBroderick et al.l (|2012bl ). We 
define a feature allocation of [N] to be a multiset of non-empty subsets of 
[N] called features, such that no index n belongs to infinitely many features. We 
write /jv = {A\, . . . , Ak}, where K is the number of features. An example fea- 
ture allocation of [6] is /g = {{2, 3}, {2, 4, 6}, {3}, {3}, {3}}. Similarly, a feature 
allocation of N is a multiset of non-empty subsets of N such that no index 
n belongs to infinitely many features. The total number of features in this case 
may be infinite, in which case we write f x = {Ai,A%, . . .}. An example fea- 
ture allocation of N is f^ = {{n : n is prime}, {n : n is not divisible by two}}. 
Finally, we may have K = 0, and foo = is a valid feature allocation. 

A partition is a special case of a feature allocation for which the features are 
restricted to be mutually exclusive and exhaustive. The features of a partition 
are often referred to as blocks or clusters. We note that a partition is always a 
feature allocation, but the converse statement does not hold in general; neither 
of the examples given above (/ 6 and f^) are partitions. 

We now turn to the problem of defining exchangeable feature alloca tions, 
extending previous work on exchangeable random partitions ( Aldous, f 985). Let 



IFn be the space of all feature allocations of [N]. A random feature allocation 
Fn of [N] is a random element of Fn- Let a : N — > N be a finite permutation. 
That is, for some finite value N a , we have cr(n) = n for all n > N a . Further, for 
any feature AcN, denote the permutation applied to the feature as follows: 
cr(A) := {cr{n) : n G A}. For any feature allocation Fn, denote the permutation 
applied to the feature allocation as follows: ct(Fn) := {o~(A) : A £ Fn}- Finally, 
let Fn be a random feature allocation of [N] . Then we say that a random feature 

allocation Fn is exchangeable if Fn = <j(Fn) for every permutation of [N]. 

In addition to exchangeability, we also require our distributions on feature 
allocations to exhibit a notion of coherence across different ranges of the index. 
Intuitively, we often imagine the indices as denoting time, and it is natural to 
suppose that the randomness at time n is coherent with the randomness at time 
n+ 1. More formally, we say that a feature allocation /m of [M] is a restriction 
of a feature allocation /jy of [N] for M < N if 

f M = {An [M] :Aef N ,An [M] 0}. 

Let 72.jv(/m) be the set of all feature allocations of [N] whose restriction to [M] 
is fn- 

Let P denote a probability measure on some probability space supporting 
(F n ). We say that the sequence of random feature allocations (F n ) is consistent 
in distribution if for all M and iV such that M < N, we have 

F(F M = / M ) = ]T P(F N = f N )- 

fN€K N (f M ) 

We say that the sequence (F n ) is strongly consistent if for all M and iV such 
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that M < TV, we have 

F N a e n N {F M ). 

Given any (F n ) that is consistent in distribution, the Kolmogorov extension 
theorem implies that we can construct a sequence of random feature allocations 
that is strongly consistent and has the same finite dimensional distributions. So 
henceforth we simply use the term "consistency" to refer to strong consistency. 

With this consistency condition, we can define a random feature allocation 
Foo of N as a consistent sequence of finite feature allocations. Thus i 7 ^ may be 
thought of as a random element of the space of such sequences: F^ = (F n )'^' =1 . 
We say that Fn is a restriction of F^ to [N] when it is the Nth element in this 
sequence. We let Too denote the space of consistent feature allocation sequences, 
of which each random feature allocation is a random element. The sigma field 
associated with this space is generated by the finite-dimensional sigma fields of 
the restricted random feature allocations F n . 

We say that Foo is exchangeable if Foo = a [Foo) for every finite permutation 
a. That is, for every permutation a that changes no indices above N for some 
N < oo, we require Fn = cr(Fjv), where Fn is the restriction of Foo to [N]. 



3 Labeling features 

Now that we have defined consistent, exchangeable random feature allocations, 
we want to characterize the class of all distributions on these allocations. We 
begin by considering some alternative representations of the feature allocation 
that are not merely useful, but indeed key to some of our later results. 

A number of autho rs have made use of matrices as a way of represent- 
ing feature allocations ( Griffiths and Ghahramani 20061 Thibaux and Jordan! . 



120071 : [Poshi et al" . 20091 ). This representation, while a boon for intuition in some 



regards, requires care because a matrix presupposes an order on the features, 
which is not a part of the feature allocation a priori. We cover this distinction 
in some detail next. 

We start by defining an a priori labeled feature allocation. Let Fn,i be 
the collection of indices in [N] with feature 1, let i*jva be the collection of 
indices in [N] with feature 2, etc. Here, we think of a priori labels as being 
the ordered, positive natural numbers. This specification is different from (a 
priori unlabeled) feature allocations as defined above since there is nothing 
to distinguish the features in a feature allocation other than, potentially, the 
members of a feature. Consider the following analogy: an a priori labeled 
feature allocation is to a feature allocation as a classification is to a clustering. 
Indeed, when each index n belongs to exactly one feature in an a priori feature 
allocation, feature 1 is just class 1, feature 2 is class 2, and so on. 

Another way to think of an a priori labeled feature allocation of [N] is as 
a matrix of N rows filled with zeros and ones. Each column is associated with 
a feature. The (n, k) entry in the matrix is one if index n is in feature k and 
zero otherwise. However, just as — contrary to the classification case — we do 
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not know the ordering of clusters in a clustering a priori, we do not a priori 
know the ordering of features in a feature allocation. To make use of a matrix 
representation for a feature allocation, we will need to introduce or find such an 
order. 

The reasoning above suggests that introducing an order for features in a 
feature allocations would be useful. The next example illustrates that the prob- 
ability F(Fjy = /at) in some sense undercounts features when they contain 
exactly the same indices: e.g., Aj = Ak for some j ^ k. This fact will suggest 
to us that it is not merely useful, but indeed a key point of our theoretical 
development, to introduce an ordering on features. 

Example 1 (A Bernoulli, two-feature allocation). Given qA,QB G (0, 1), draw 

Z n ,A *~ Bernf^) and Z n _B *~ Bern(qs), independently, and construct the 
random feature allocation by collecting those indices with successful draws: 

F N := {{n : n < N, Z n , A = 1}, {n : n < N, Z n . B = 1}}. 

One caveat here is that if either of the two sets in the multiset Fn is empty, 
we do not include it in the allocation. Note that calling the features A and B 
was merely for the purposes of construction, and in defining Fn, we have lost 
all feature labels. So Fn is a feature allocation, not an a priori labeled feature 
allocation. 

Then the probability of the feature allocation F5 = f§ := {{2, 3}, {2, 3}} is 

^(l-<?A) 3 <?|(l-te) 3 , 

but the probability of the feature allocation F5 = f' 5 := {{2, 3}, {2, 5}} is 

2gi(l-<Li) 3 4(l-fe) 3 . 

The difference is that in the latter case the features can be distinguished, and so 
we must account for the two possible pairings of features to frequencies {qA, Qb}- 
Now, instead, let Fn be Fn with the features ordered uniformly at random 
amongst all possible feature orderings. There is just a single possible ordering 
of / 5 , so the probability of F 5 = f 5 := ({2, 3}, {2, 3}) is again 

q 2 A (l-q A ) 3 q%0--qB) 3 . 

However, there are two orderings of /g, each of which is equally likely. The 
probability of F N = ft := ({2, 5}, {2, 3}) is 

ql(l-q A ) 3 q%0--qB) 3 . 

The same holds for the other ordering. ■ 

This example suggests that there are combinatorial factors that must be 
taken into account when working with the distribution of Fn directly. The 
example also suggests that we can avoid the need to specify such factors by 
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instead working with a suitable randomized ordering of the random feature 
allocation Fjy- We achieve this ordering in two steps. 

The first step involves ordering the features via a procedure that we refer to 
as order-of- appearance labeling. The basic idea is that we consider data indices 
n = 1, 2, 3, and so on in order. Each time a new data point arrives, we examine 
the features associated with that data point. Each time we see a new feature, 
we label it with the lowest available feature label from k = 1, 2, 

In practice, the order-of-appearance scheme requires some auxiliary ran- 
domness since each index n may belong to zero, one, or many different features 
(though the number must be finite). When multiple features first appear for 
index n, we order them uniformly at random. That simple idea is explained 
in full detail as follows. Recursively suppose that there are K features among 
the indices [N — 1] . Trivially there are zero features when no indices have been 
seen yet. Moreover, we suppose that we have features with labels 1 through 
K if K > 1, and if K = 0, we have no features. If features remain with- 
out labels, there exists some minimum index n in the data indices such that 
n i {Jk=iAk, where the union is if K = 0. It is possible that no features 
contain n. So we further note that there exists some minimum index m such 
that m ^ UjLi Aj but m is contained in some feature of the allocation. By con- 
struction, we must have m > N . Let K m be the number of features containing 
m; K m is finite by definition of a feature allocation. Let ([/*,) denote a sequence 
of iid uniform random variables, independent of the random feature allocation. 
Assign Uk+i, ■ ■ ■ , Ux+K m to these new features and determine their order of 
appearance by the order of these random variables. While features remain to 
be labeled, continue the recursion with TV now equal to m and K now equal to 
K + K m - 

Example 2 (Feature labeling schemes). Consider the feature allocation 

/ 6 = {{2, 5, 4}, {3, 4}, {6, 4}, {3}, {3}}. (1) 

And consider the random variables 

^1,^2,^3,^5 ~ Unif[0,l]. 

We see from f§ that index 1 has no features. Index 2 has exactly one feature, so 
we assign this feature, {2,5,4}, to have order-of-appearance label 1. While U\ 
is associated with this feature, we do not need to break any ties at this point, 
so it has no effect. 

Index 3 is associated with three features. We associate each feature with 
exactly one of U2, U3, and U4 (the next three available For instance, pair 
{3,4} with U 2 , {3} with C/ 3 , and the other {3} with J7 4 . Suppose it happens 
that U3 < U2 < U4. Then the feature {3} paired with J7 3 receives label 2 (the 
next available order-of-appearance label). The feature {3,4} receives label 3. 
And the feature {3} paired with U4 receives label 4. 

Index 4 has three features, but {2, 5, 4} and {3, 4} are already labeled. So the 
only remaining feature, {6,4}, receives the next available order-of-appearance 
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Figure 2: Order-of-appearance binary matrix representations of the sequence of 
feature allocations on [2], [3], [4], [5], and [6] found by restricting / 6 in Examplc[5J 
Rows correspond to indices n, and columns correspond to order-of-appearance 
feature labels k. A gray square indicates a 1 entry, and a white square indicates 
a entry. Y°, the set of order-of-appearance feature assignments of index n, is 
easily read off from the matrix as the set of columns with entry in row n equal 
to 1. 



label: 5. U§ is associated with this feature, but since we do not need to break 
ties here, it has no effect. Indices 5 and 6 belong to already-labeled features. 
So the features can be listed with order-of-appearance indices as 

A 1 = {2, 5, 4}, A 2 = {3}, A 3 = {3, 4}, A 4 = {3},A 5 = {6, 4}. (2) 

Let Y° indicate the set of order-of-appcarancc feature labels for the features 
to which index n belongs; i.e., if the features are labeled according to order of 
appearance as in Eq. @, then Y° = {k : n <G A^}. By definition of a feature 
allocation, Y° must have finite cardinality. The order-of-appearance labeling 
gives Y? = 0, Y 2 ° = {1}, Y 3 ° = {2, 3, 4}, F 4 ° = {1, 3, 5}, Y 5 ° = {1}, Y 6 ° = {5}. 

Ordcr-of-appearance labeling is well-suited for matrix representations of fea- 
ture allocations. The rows of the matrix correspond to indices n and the columns 
correspond to features with order-of-appearance labels k. The matrix represen- 
tation of the order-of-appearance labeling and resulting feature assignments 
(Y°) for n e [6] is depicted in Figure [2] ■ 

Note that when the feature allocation is a partition, there is exactly one 
feature containing any m, so this scheme reduces to the order-of-appearance 
scheme for cluster labeling. 

Consider an exchangeable feature allocation i 7 ^. Give order-of-appearance 
labels to the features of this allocation, and let Y° be the set of feature labels for 
features containing n. So Y° is a random finite subset of N. It can be thought 
of as a simple po int process on N; a discussion of measurability of such processes 
may be found in iKallenberd (I2002L p. 178). Our process is even simpler than a 



simple point process as it is globally finite rather than merely locally finite. 

Note that (Y^)^ =1 is not necessarily exchangeable. For instance, consider 
again Example [TJ If Y x ° is non-empty, 1 £ Yf with probability one. If Y 2 ° is non- 
empty, with positive probability it may not contain 1. To restore exchangeability 



we extend an idea due to lAldousI ([1985) in the setting of random partitions, 



associating to each feature a draw from a uniform random variable on [0, 1]. 
Drawing these random variables independently we maintain consistency across 
different values of N. We refer to these random variables as uniform random 
feature labels. 

Note that the use of a uniform distribution is for convenience; we simply 
require that features receive distinct labels with probability one, so any other 
continuous distribution would suffice. We also note that in a full-fledged model 
based on random feature allocations these labels often play the role of param- 
eters and are use d in defining th e likelih ood. For further discussion of such 
constructions, see iBroderick et ah I (l2012bl) . 



Thus, let {(f>k) be a sequence of iid uniform random variables, independent 
of both (C/fc) and i^. Construct a new feature labeling by taking the feature 
labeled k in the order-of-appearance labeling and now label it 0fc. In this case, 
let yjj denote the set of feature labels for features to which n belongs. Call this 
a uniform random labeling. YJ can be thought of a s a (globally finite ) simple 
point process on [0, 1]. Again, we refer the reader to Kallenbera ()2002i p. 178) 
for a discussion of mcasurability. 

Example 3 (Feature labeling schemes (continued)). Again consider the feature 
allocation 

U = {{2, 5, 4}, {3, 4}, {6, 4}, {3}, {3}}. 
Now consider the random variables 

U%, U 2 , U 3 , [7 4 , U 5 , 0i, 02, 03, 04, 05 ~ Unif[0, 1]. 

Recall from Example [2] that U±, . . . , Us gave us the order-of-appearance labeling 
of the features. This labeling allowed us to index the features as in Eq. @, 
copied here: 

A x = {2, 5, 4}, A 2 = {3}, A 3 = {3, 4}, A 4 = {3},A 5 = {6, 4}. (3) 

With this order-of-appearance labeling in hand, we can assign a uniform 
random label to each feature. In particular, we assign the uniform random label 
0^ to the feature with order-of-appearance label k: A\ = {2, 5, 4} gets label 0i, 
A2 = {3} gets label 02, A 3 = {3,4} gets label 03, A4 = {3} gets label 04, and 
A5 = {6,4} gets label 05. Let YJ indicate the set of uniform random feature 
labels for the features to which index n belongs. The uniform random labeling 
gives 

= 0,Y 2 f = {01},^ = {02, 03, 04}, Y 4 f = {01,03,05}^ - {<k},Y% = {<M- 

(4) 



Lemma 4. Give the features of an exchangeable feature allocation uniform 
random labels, and let Yj be the set of feature labels for features containing 
n. So YJ is a random finite subset of [0,1]. Then the sequence (YJ)J£Li is 
exchangeable. 
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Figure 3: An illustration of the uniform random feature labeling in Example |3l 
The top rectangle is the unit interval. The uniform random labels are depicted 
along the interval with vertical dotted lines at their locations. The indices [6] 
are shown to the left. A black circle shows appears when an index occurs in the 
feature with a given label. The matrix representations of this feature allocation 
in Figure [4] can be recovered from this plot. 



Proof. Note that (Y^)^ =1 = g{{<j>k)k, {Uk)k,Foo) for some measurable function 
g. So, for any finite permutation a, we have that (*4„\)r» = g{{<t>T(k))k, (Uk)k,cr(F c 
where r is a finite permutation that is a function of p, (Uk), o~, and F^. Now 

((0r W )fe,(%) fe ,a(F oo )) L ((&)*, (E/iOfc.^oo)) 

since the iid sequence (0fc)fcj the iid sequence (Uk)k-, and are independent 
by construction and 

((0fc)fc, (E/iOfc.^Foo)) = ((0 fe ) fc , (^fc.Foo) 

since the feature allocation is exchangeable and the independence used above 
still holds. So 

ff((0T(fc))*) (£4)fc,c(-Foo)) = g{{4>k)k, (C4)fe,-Foo) 

It follows that the sequence (Y^)n is exchangeable. □ 

We can recover the full feature allocation from the sequence 5j , 

In particular, if {x\,X2, ■ ■ •} arc the unique values in {Y± , Y^, . . .}, then the 
features are {{n : Xk £ : fc = 1, 2, . . .}. The feature allocation can similarly 
be recovered from the order-of- appearance label collections (Y°). 

We can also recover a new random ordered feature allocation Fn from the 
sequence (Yj[). In particular, Fjv is the sequence — rather than the collection — of 
features {n : Xk £ Y^} such that the feature with smallest label 4>k occurs first, 
and so on. This construction achieves our goal of avoiding the combinatorial fac- 
tors needed to work with the distribution of Fjv, while retaining exchangeability 
and consistency. 
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Figure 4: The same consistent sequence of feature allocations in Figure [2] but 
now with the uniform random order of Example [5] instead of the order of ap- 
pearance illustrated in Figure [5] 



Example 5 (Feature labeling schemes (continued)). Once more, consider the 
feature allocation 

h = {{2, 5, 4}, {3, 4}, {6, 4}, {3}, {3}}. 

and the uniform random labeling in Eq. Q . If it happens that </>4 < <p§ < <p2 < 
4>i < 4>3, then the random ordered feature allocation is 

fe = ({3}, {6, 4}, {3}, {2, 5, 4}, {3, 4}). 

■ 

Recall that we were motivated by Example [T] to produce such a random 
ordering scheme to avoid obfuscating combinatorial factors in the probability 
of a feature allocation. From another perspective, these factors arise because 
the random labeling is in some sense more natural than alternative labelings; 
again, consider random labels as iid parameters for each feature. While order- 
of-appcarance labeling is common due to its pleasant aesthetic representation in 
matrix form (compare Figures [2] and [4j , one must be careful to remember that 
the resulting label sets (Y°) are not exchangeable. We will use random labeling 
extensively below since, among other nice properties, it preserves exchangeabil- 
ity of the sets of feature labels associated with the indices. 



4 Exchangeable feature probability function 

In general, given a probability of a random feature allocation, P(i 7 W = /n), wc 
can find the probability of a random ordered feature allocation P(-FV = /jv) as 
follows. Let H be the number of distinct features of F/v, and let (K-y, . . . , Kr) 
be the multiplicities of these distinct features in decreasing order. Then 



n 



where 



y K 1 ,...,K H J K r \---K H \ 

For partitions, the effect of this multiplicative factor is the same across all 
partitions with the same number of clusters; for some number of clusters K, 
it is just 1/K\. In the general feature case, the multiplicative factor may be 
different for different feature configurations with the same number of features. 

Example 6 (A Bernoulli, two- feature allocation (continued)). Consider Fn 
constructed as in Example [T] Denote the sizes of the two features by Mjv,i and 
Mjy,2- Then 

HFn = In) = \jA N1 {l ~ q A ) N - MN ^B N ' 2 {l - q B ) N ' M ^ 

= p(N,M N ,i,M N>2 ). (6) 

Here, p is some function of the number of indices N and the feature sizes 
(Mjv,i, -Mjv,2) that we note is symmetric in (Mjv,i, M/v :2 ); i-e., p(N, Mn,i, Mm,2) = 
p(N,'M Nt2 ,'M Nil ). ' ■ 

When the feature allocation probability admits the representation 

nF N = f N )=p(N,\A 1 \,...,\A K \) (7) 

for every ordered feature allocation /jv = (A±, . . . , Ak) and some function p 
that is symmetric in all arguments after the first, we call p the exchangeable 
feature probability function (EFPF). We take care to note that the exchangeable 
partition probability function (EPPF), which always exists for partitions, is not 
a special case of the EFPF. Indeed, the EPPF assigns zero probability to any 
multiset in which an index occurs in more than one feature of the multiset; e.g., 
{{l},{2}}isa valid partition and a valid feature allocation of [2], but {{1},{1}} 
is a valid feature allocation but not a valid partition of [2] . Thus, the EPPF must 
examine the feature indices of a feature allocation to judge their exclusivity and 
thereby assign a probability. By contrast, the indices in the multiset provide 
no such information to the EFPF; only the sizes of the multiset features are 
relevant in the EFPF case. 

Proposition 7. The class of exchangeable feature allocations with EFPFs is a 
strict but non-empty subclass of the class of exchangeable feature allocations. 

Proof. Example [8] below shows that the class of feature allocations with EF- 
PFs is non-empty, and Example |9] below establishes that there exist simple 
exchangeable feature allocations without EFPFs. □ 

Example 8 (Three-parameter Indian buffe t process). The Indian buffet pro- 
cess (IBP) ([Griffiths and Ghahramanl 120061 ) is a generative model for a random 
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Buffet dishes 




Figure 5: Illustration of an Indian buffet process in the order-of-appcarancc 
representation of Figure [2] The buffet (top) consists of a vector of dishes, 
corresponding to features. Each customer — corresponding to a data point — who 
enters the restaurant first decides whether or not to choose dishes that the other 
customers have already sampled. The customer then selects a random number 
of new dishes, not previously sampled by any customer. A gray box in position 
(n, k) indicates customer n has sampled dish k, and a white box indicates the 
customer has not sampled the dish. In the example, the second customer has 
sampled exactly those dishes indexed by 2, 4, and 5: Y" 2 ° = {2, 4, 5}. 



feature allocation t hat is specified recursively in a manner akin to the Chinese 
restaurant process (lAldoud . fl985h in the case of partitions. The metaphor in- 
volves a set of "customers" that enter a restaurant and sample a set of "dishes." 
Order the customers by placing them in one-to-one correspondence with the 
indices n G N. The dishes in the restaurant correspond to feature labels. Cus- 
tomers in the Indian buffet can sample any non-negative integer number of 
dishes. The set of dishes chosen by a customer n is just Y°, the collection of 
feature labels for the features to which n belongs, and the procedure described 

below provides a way to construct Y° recursively. 

We describe an extended version (|Teh and Gorurj . l2009t iBroderick et al 



2012al) of the Indian buffet that includes two extra pa rameters beyond the single 



mass parameter 7 (7 > 0) originally specified by iGriffiths and Ghahramani 
( 20061) : in particular, we include a concentration parameter 6 (6 > 0) and a 
discount parameter a (a £ [0, 1)). We abbreviate this three-parameter IBP 
as "3IBP." The single-parameter IBP may be recovered by setting 6 — 1 and 
a = 0. 

We start with a single customer, who enters the buffet and chooses Kf ~ 
Poisson(7) dishes. None of the dishes have been sampled by any other cus- 
tomers since no other customers have yet entered the restaurant. An order-of- 
appearance labeling gives the dishes labels 1, . . . , K± if > 0. 
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Recursively, the nth customer chooses which dishes to sample in two phases. 
First, for each dish k that has previously been sampled by any customer in 
1, . . . , n — 1, customer n samples dish k with probability 

M n _i |fc - a 
9 + n-l ' 

for M nj fc equal to the number of customers indexed l,...,n who have tried 
dish k. As each dish represents a feature, sampling a dish represents that the 
customer index n belongs to that feature. And M n ^ is the size of the feature 
labeled k in the feature allocation of [n]. 
Next, customer n chooses 



Kt ~ Poisson 



r(0+l) r(0 + a-l + n) 

7 r(0 + ?i) r(0 + a) 



new dishes to try. If K+ > 0, then the dishes receive unique order-of-appearancc 
labels K n -i + l, . . . , K n . Here, K n represents the number of sampled dishes after 
n customers: K n = K n -\ + A'+ (with base case K = 0). 

With this generative model in hand, we can find the probability of a partic- 
ular feature allocation. We discover its form by enumeration. At each round n, 
we have a Poisson number of new features, K£, represented. The probability 
factor associated with these choices is a product of Poisson densities: 

N 1 

I — l C ( n > 7, 6, a)fi exp (-C(n, 7 , 0, a)) , 



where 



„, r(0 + i) r(0 + a-i 

C(n,7,0,a) := 7- 



T(0 + n) T(0 + a) 

Let i?^ be the round on which the fcth dish, in order of appearance, is 
first chosen. Then the denominators for future dish choice probabilities are the 
factors in the product (0 + Rk) ■ (9 + Rk + 1) ■ ■ ■ (9 + N — 1). The numerators 
for the times when the dish is chosen are the factors in the product (1 — a) • 
(2 — a) ■ ■ ■ {Misr,k — 1 — a)- The numerators for the times when the dish is not 
chosen yield (9 + Rk — 1 + a) ■ ■ ■ (9 + N — 1 — A/jv,fc + 01). Let A n ^ represent the 
collection of indices in the feature with label k after n customers have entered 
the restaurant. Then M Uj k = \A n ^\- 

Finally, let K\ , . . . , be the multiplicities of distinct features formed by 
this model. We note that there are 



A' 



n=l 



H 



h=l 



rearrangements of the features generated by this process that all yield the same 
feature allocation. Since they all have the same generating probability, we 
simply multiply by this factor to find the feature allocation probability. 
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Multiplying all factors togetheiQ and taking /„ = {An,i, ■ • ■ , An,k n } yields 



P(F N = f N ) 




T r r(Afjy, fc -a) T(9 + N- M N . k + a) 



AJ- T(l - a) T(6 + N) 



It follows from Eq. ([5]) that the probability of a uniform random ordering of 
the feature allocation is 



The distribution of Fn has no dependence on the ordering of the indices in 
[N]. Hence, the distribution of Fn depends only on the same quantities — the 
number of indices and the feature sizes — and the feature multiplicities. So we 
see that the 3IBP construction yields an exchangeable random feature alloca- 
tion. Consistency follows from the recursive construction and exchangeability. 
Therefore, Eq. © is seen to be in EFPF form given by Eq. ([7]). ■ 

The three-parameter Indian buffet process has an EFPF representation, but 
the following simple model does not. 

Example 9 (A general two- feature allocation). We here describe an exchange- 
able, consistent random feature allocation whose (ordered) distribution does 
not depend only on the number of indices N and the sizes of the features of the 
allocation. 

Let pw,poi,pu,poo be fixed frequencies that sum to one. Let Y n represent 
the collection of features to which index n belongs. For n <E {1,2}, choose Y n 
independently and identically according to: 



P(F N = f N ) 




Jfc=l 



n 



r(M jv , fc - a) T(0 + N- M N , k + a) 

r(i-a) ' r(e + N) 



(8) 



Y n = { 



{1} with probability pxo 

{2} with probability poi 

{1,2} with probability pn 

with probability poo . 



Readers curious about how the R k terms disappear may observe that 




15 



We form a feature allocation from these labels as follows. For each label (1 or 
2), collect those indices n with the given label appearing in Y n to form a feature. 
Now consider two possible outcome feature allocations: fa = {{2}, {2}}, and 
= {{1}, {2}}. The probability of any ordering fa of fa under this model is 

F(F 2 = fa)=p° w P° 01 P^pIo- 

The probability of any ordering f 2 °f f-i ls 

nF2=~f' 2 )=p\ Pl a p q 00 - 

It follows from these two probabilities that we can choose values of pio , poi , Pn , Poo 
such that P(i^2 = fa) 7^ V(F2 = f£). But fa and f 2 have the same feature counts 
and TV value (TV = 2). So there can be no such symmetric function p, as in 
Eq. (|6]), for this model. ■ 



5 The Kingman paintbox and feature paintbox 

Since the class of exchangeable feature models with EFPFs is a strict subclass of 
the class of exchangeable feature models, it remains to find a characterization 
of the latter class. Noting that the sequence of feature collections is an 
exchangeable sequence when the uniform random labeling of features is used, 
we might turn to the de Finetti mixing measure of this exchangeable sequence 
for such a characterization. 

Indeed, in the partition case, the Kingman paintbox (jKingma il ll978llAldoul . 

provides just such a characterization. 



Theorem 10 (Kingman paintbox). Let IIoo := (IL^^L-^ be an exchangeable 
random partition ofN, and let (M^ k , k > 1) be the decreasing rearrangement of 
cluster sizes ofH n with k = ifYl n has fewer than k clusters. Then k /n 
has an almost sure limit p k as n — > oo for each k. Moreover, the conditional 
distribution of given (pj,,k > 1) is as if II oo were generated by random 
sampling from a random distribution with ranked atoms (p k , k > 1). 

When the partition clusters are labeled with uniform random labels rather 
than by the ranking in the statement of the theorem above, Kingman's paint- 
box provides the de Finetti mixing measure for the sequence of partition labels 
of each index n. Two representations of an example Kingman paintbox are 
illustrated in Figure [6j The Kingman paintbox is so named since we imagine 
each subinterval of the unit interval as containing paint of a certain color; the 
colors have a one-to-one mapping with the uniform random cluster labels. A 
random draw from the unit interval is painted with the color of the Kingman 
paintbox subinterval into which it falls. While Figure |5] depicts just four subin- 
tervals and hence at most four clusters, the Kingman paintbox may in general 
have a countable number of subintervals and hence clusters. Moreover, these 
subintervals may themselves be random. 
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7 12 6 4 



02 
01 

] 04 

Figure 6: Left: An example Kingman paintbox. The upper rectangle repre- 
sents the unit interval. The lower rectangles represent a partition of the unit 
interval into four subintervals corresponding to four clusters. The horizontal 
locations of the seven vertical lines represent seven uniform random draws from 
the unit interval. The resulting partition of [7] is {{3, 5}, {7, 1, 2}, {6}, {4}}. 
Right: An alternate representation of the same Kingman paintbox, now with 
each subinterval separated out into its own vertical level. To the right of each 
cluster subinterval is a uniform random label (with index determined by order 
of appearance) for the cluster. 
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Note that the ranked atoms need not sum to one; in general, X)fc/°fc — !• 
When random sampling from the Kingman paintbox docs not select some atom k 
with p\. > 0, a new cluster is formed but it is necessarily never selected again for 
another index. In particular, then, a corollary of the Kingman paintbox theorem 
is that there are two types of clusters: those with unbounded size as the number 
of indices N grows to infinity and those with exactly one member as N grows 
to infinity; the latter are sometimes referred to as singletons or collectively as 
Kingman dust. In the feature case, we impose one further regularity condition 
that essentially rules out dust. Consider any feature allocation F^. Recall that 
we use the notation YjJ to indicate the set of features to which index n belongs. 
We assume that, for each n, with probability one there exists some m with 
m 7^ n such that Y^ — Y^ . Equivalently, with probability one there is no index 
with a unique feature collection. We call a random feature allocation that obeys 
this condition a regular feature allocation. 

We can prove the following theorem for the feature case, analogous to the 
Kingman paintbox construction for partitions. 

Theorem 11 (Feature paintbox). Let := (F n ) be an exchangeable, con- 
sistent, regular random feature allocation of N. There exists a random se- 
quence (Cfc)2Li such that Ck is a countable union of subintervals of [0, 1] (and 
may be empty) and such that F^, has the same distribution as F^ where F^ 
is generated as follows. Randomly sample {U' n ) n iid uniform in [0,1]. Let 
Y n := {k : U' n € Ck} represent a collection of feature labels for index n, and let 
F^ be the induced feature allocation from these label collections. 

Proof. Given Fqo as in the theorem statement, we can construct {Yn)ri=i as m 
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LemmaHJ Then, according to Lemma [H (Y^)^ =1 is an exchangeable sequence. 
Note that defines a partition: n ~ m (i.e., n and m belong to the same 
cluster of the partition) if and only if Y^ = Y^ n . This partition is exchangeable 
since the feature allocation is. Moreover, since we assume there are no singletons 
in the induced partition (by regularity) , the Kingman paintbox theorem implies 
that the Kingman paintbox atoms sum to o ne. 

By de Finetti's theorem (lAldousl . Il985h . there exists a such that a is the 



directing random measure for (Y%). Condition on a = [i. Write \i = 2j=i Qj( 
where the qj satisfy qj <E (0, 1] and are written in monotone decreasing order: 
Qi > 92 > • • • ■ The condition that the atoms of the paintbox sum to one 
translates to YlJLi Qj = 1- The (xj) are the (countable) unique values of Y%, 
ordered to agree with the qj . The strong law of large numbers yields 



N-^in-.nKN^Y^x^^q^ N 



oo. 



Since J^JLi Qj = 1; we can partition the unit interval into subintervals of 
length qj. The jth such subinterval starts at Sj := Y^?i=i Ql an d en ds at ej := 
s j+1 . For k = 1,2,..., define C k := {Jj-.^exj i s 3> e i)- Wc cal1 thc ( C k)T=i thc 
feature paintbox. 

Then has the same distribution as the following construction. Let 
(U[, U2, ■ ■ ■) be an iid sequence of uniform random variables. For each n, define 
Y n = {k : U' n G Ck} to be the collection of features, now labeled by positive 
integers, to which n belongs. Let be the feature allocation induced by the 

{Yn)- □ 

A point to note about this feature paintbox construction is that the ordering 
of the feature paintbox subsets Ck in the proof is given by the order of appear- 
ance of features in the original feature allocation F^. This ordering stands in 
contrast to the ordering of atoms by size in the Kingman paintbox. Making 
use of such a size-ordering would be more difficult in the feature case due to 
the non-trivial intersections of feature subsets. A particularly important impli- 
cation is that the conditional distribution of F x given {Ck)k is not the same 
as that of F^ given (Ck)k (cf. IPitman (1995) for similar ordering issues in the 



partition case). 

An example feature paintbox is illustrated in Figure [7] Again, we may think 
of each feature paintbox subset as containing paint of a certain color (where these 
colors have a one-to-one mapping with the uniform random labels) . Draws from 
the unit interval to determine the feature allocation may now be painted with 
some subset of these colors rather than just a single color. 

Next, we revisit earlier examples to find their feature paintbox representa- 
tions. 

Example 12 (A general two-feature allocation (continued)). The feature paint- 
box for the random feature allocation in Example [9] consists of two features. Thc 
total measure of the paintbox subset for feature 1 is pio + Pn- The total mea- 
sure of the paintbox subset for feature 2 is poi + Pn- The total measure of the 
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Figure 7: An example feature paintbox. The top rectangle represents the unit 
interval. Each vertical level below the top rectangle represents a subset of 
the unit interval corresponding to a feature. To the right of each subset is 
a uniform random label for the feature. For example, using the notation of 
Theorem [TTJ the topmost subset is C2 corresponding to feature label cj>2- The 
vertical dashed lines represent uniform random draws; i.e., U' n for index n. 
The resulting feature allocation of [7] for this realization of the construction is 
{{3, 5, 7, 1}, {5, 7}, {7, 1}, {6}, {6}}. The collection of feature labels for index 7 
is Y'j = {(j>2, 03, 0i }• The collection of feature labels for index 4 is Y4 = 0. 




Figure 8: A feature paintbox for the two-feature allocation in Example |H1 The 
top rectangle is the unit interval. The middle rectangle is the feature paintbox 
subset for feature 1. The lower rectangle is the feature paintbox subset for 
feature 2. 
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intersection of these two subsets is p\\. A depiction of this paintbox appears in 
Figure 1 ■ 



Example 13 (Three-parameter Indian buffet process (continued)). The 3IBP 
turns out to be an instance of a general class of exchangeable feature models that 
we refer to as feature frequency models. This class of models not only provides 
a straightforward way to construct feature paintbox representations in general, 
but also plays a key role in our general theory, providing a link between feature 
paintboxes and EFPFs. In the following section, we define feature frequency 
models, develop the general construction of paintboxes from feature frequency 
models, and then return to the construction of the feature paintbox for the 3IBP 
as an example. We subsequently turn to the general theoretical characterization 
of feature frequency models. I 



6 Feature frequency models 

We now discuss a general class of exchangeable feature models for which it 
is straightforward to describe the feature paintbox. Let (Vk) be a sequence 
of (not necessarily independent) random variables with values in [0, 1] such 

that Y^kLi Vk < oo almost surely. Let <f>k ^ Unif[0, 1] and independent of 
the (Vk). A feature frequency model is built around a random measure B = 
Y^kLi ^fe^fc- We may draw a feature allocation given B as follows. For each 
data point n, independently draw its features like so: for each feature indexed 
by k, independently make a Bernoulli draw with success probability Vk- If the 
draw is a success, n belongs to the feature indexed by k (i.e., the feature with 
label 4>k). If the draw is a failure, n does not belong to the feature indexed by 
k. The feature allocation is induced in the usual way from these labels. 

The condition that the frequencies have an almost surely finite sum guar- 
antees, by the Borcl-Cantclli lemma, that the number of features exhibited by 
any index n is almost surely finite, as required in the definition of a feature 
allocation. We obtain exchangeable feature allocations simply by virtue of the 
fact that the feature allocations are independently and identically distributed 
given B. The Bernoulli draws from the feature frequencies guarantee that the 
feature allocation is regular. 

Before constructing the feature paintbox for such a model, we note that Vk is 
the total length of the paintbox subset for the feature indexed by k. In this sense, 
it is the frequency of this feature (hence the name "feature frequency model"). 
And </>fc is the uniform random feature label for the feature with frequency Vk- 
Finally, to achieve the independent Bernoulli draws across k required by the 
feature allocation specification, we need for the intersection of any two paintbox 
subsets to have length equal to the product of the two paintbox subset lengths. 
This desideratum can be achieved with a recursive construction. 

First, divide the unit interval into one subset (call it I\) of length V\ and 
another subset (call it Iq) of length 1 — V\, Then I\ is the paintbox subset for 
the feature indexed by 1. Recursively, suppose we have paintbox subsets for 
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Figure 9: An example feature paintbox for a feature frequency model (Scction|6]). 
One such model is the 3IBP (Example [M]) . 



features indexed 1 to K — 1. Let e be a binary string of length K — 1. Suppose 
that I e is the intersection of (a) all paintbox subsets for features indexed by k 
(k < K) where the fcth digit of e is 1 and (b) all paintbox subset complements 
for features indexed by k (k < K) where the fcth digit of e is 0. For every e, 
we construct I( e ,i) to be a subset of I e with total length equal to Vk times the 
length of I e . We construct I( e ,o) to be 7 e \-f(e,i)- 

Finally, the paintbox subset for the feature indexed by K is the union of all 
I e i with e' a binary string of length K such that the final digit of e' is 1. An 
example of such a paintbox is illustrated in Figure [9] 

Example 14 (Three-parameter Indian buffet process (continued)). We show 
that the three-parameter Indian buffet process is an example of a feature fre- 
quency model, and thus its feature paintbox can be constructed according to 
the general recipe that we have just presented. 

The underlying random measure for the thre e-parameter Indian buffet pro- 

cess is known as the three-parameter beta process (jTeh and Goriir , 120091: iBroderick et al 



2012ah . This random measure, denoted B, can be constructed explicitly via the 



following recursion (with A'o = 0): 

A+ ~ Poisson (, T{e + 1) • nO + ^-l + n 
" ° SS V r(8 + n) r(8 + a) 

K n =K n _ 1 +K+ 

V k - Beta(l - a, 9 + n + a), k = K n -i + l,...,K n 
<Pk - Unif[0, 1] 

OO 

B = Y / Vk5 4lk , 

k=l 

where we recall that the <pk are assumed to be drawn from the uniform dis- 
tribution for simplicity in this paper, but in general they may be drawn from 
a continuous distribution that serves as a prior for the parameters defining a 
likelihood. 

Given B — X^fcli^fe'Vfc' ^ ne feature allocation is drawn according to the 
procedure outlined for feature frequency models conditioned on the u nderly- 
ing random measure. Building on work of Thibaux and Jordan ( 2007h in the 
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case of the IBP, iTeh and Gorfirl (j2009l) demonstrate that the distribution of the 
resulting feature allocation is the same as if it were generated according to a 
three-parameter Indian buffet process. ■ 



We have seen that the 3IBP can be represented as a feature frequency model. 
It is straightforward to observe that the two-feature model in Examples |H] and 
1121 cannot be represented as a feature frequency model unless the intersection 
of the feature subsets has length p\\ equal to the product of the feature subset 
lengths (pro + Pn and p i +P11); i.e., unless (p w + pu)(poi + Pu) = Pn (cf. 
Figure [5]). Therefore, we have the following result similar to Proposition [7J 

Proposition 15. The class of feature frequency models is a strict but non-empty 
subclass of the class of exchangeable feature allocations. 

In proving Propositions [T5] and [7J we used the 3IBP as an example that 
belongs to both the class of feature models with EFPFs and the class of feature 
frequency models. Moreover, in both cases we used two-feature models as an ex- 
ample of exchangeable feature models that do not belong to these subclasses; in 
particular, we used two-feature models in which the feature combination proba- 
bilities pio,poi,pn,poo are not in the necessary proportions. These observations 
suggest that feature frequency models and EFPFs may be linked. We flesh out 
the relationship between the two representations in the next few results. 

We start with a priori labeled features. Recall from Section[3]that an a priori 
labeled feature allocation is to a feature allocation what a classification is to a 
clustering; that is, the feature labels are known in advance. The case where we 
know the feature order in advance is somewhat easier and gives intuition for the 
type of result we would like in the true feature allocation case. In particular, we 
prove the results for the case of two a priori labeled features in Theorem [TBI and 
then the case of an unbounded number of a priori labeled features in Thcorcm ll7l 

From there, we move on to the (a priori) unlabeled case that is the focus of 
the paper and prove the equivalence of EFPFs and a slight extension of feature 
frequency models in Theorem 1181 

Theorem 16. Consider a model with two a priori labeled features: feature 1 and 
feature 2. If the two features are generated from labeled feature frequencies, the 
probability of an a priori labeled feature allocation of [N] with Mj\r,i occurrences 
of feature 1 and Mjv,2 occurrences of feature 2 takes the form p{N\ Mjv.i, Mn^), 
where we make no symmetry assumptions about p here and also allow any of 
Af/v.i and M/v,2 to be zero. Conversely, if the probability of any a priori labeled 
feature allocation can be written as p{N; M/v,i, Mn^), then the feature allocation 
has the same distribution as if it were generated from labeled feature frequencies. 

Proof. Note that throughout this proof we consider the probability of a partic- 
ular labeled feature allocation of [N] with Mm.i occurrences of feature 1 and 
Mn.2 occurrences of feature 2, as distinct from the probability of all labeled 
feature allocations of [N] with Mjv,i occurrences of feature 1 and Mn,2 occur- 
rences of feature 2. The latter, which is not addressed here, would be the sum 
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over instances of the former. In particular, recalling the matrix representation 
from Section [3l there are 

N \ / N 



possible N x 2 matrices with Mjv,i ones in the first column and Mn,2 ones in 
the second column. 

The reader may feel there is some similarity in this setup to the two-feature 
allocation of Examples IH1 and 1121 We note that the quantities Pio,Poi,PiiiPoo — 
which retain essentially the same meaning as in Figure [8] — may now be random 
and that their order is pre-specificd and non-random. 

First, we calculate the probability of a certain labeled feature configuration 
under this model. Let M' n 10 be the number of indices in [n] with feature 1 but 
not feature 2. Let M' n 01 be the number of indices in [n] with feature 2 but not 
feature 1. Let M' n 00 count the indices with neither feature, and let M' n n count 
the indices with both features. Then 

P(F N ,1 = JN,1,F N , 2 = JN,2) = E \Pio Poi Pll Poo )■ ( 9 ) 

Denote the total probabilities of features 1 and 2 as, respectively, q\ = 
Pio + Pn and <7a = Poi +pu- Suppose that we have a feature frequency model. 
This assumption implies that 

Pio a = - 92), Poi a = (1 -<7i)<72, Pn a = 9192, Poo °=' (1 - 9i)(l - 92), 

(10) 

where any one of the equalities in Eq. (|TU)) implies the others. It follows that 

P(£v,i - /jv.1,^,2 - /iv, 2 ) = E^^l - 9i) iV - M ^ 1 9 2 MN ' 2 (l - 9>) N - M "'% 

(11) 

where M ni i = M' n 10 + L1 is the total number of indices with feature 1, and 
likewise M n ^ = M' n 01 + M' n n is the total number of indices with feature 2. 

So we see that making a feature frequency model assumption yields a feature 
allocation probability in Eq. (fTTj) that depends only on N, Mjv,i, Mjv,2' Since 
we retain the known labeling in this example, the probability is not symmetric 
in Mjv,i and Mjv,2- 

In the other direction, suppose we know that 

P(Fjv,i = f N ,i,FN,2 = In, 2) = P(N, M NtU M N>2 ) (12) 

for some function p. Again, we make no symmetry assumptions about p here, 
and any of Mjv,i and M^,2 may be zero. Then frequencies Pio,Poi,Pn,Poo must 
exist by the law of large numbers; we note they may be random. 
The assumption in Eq. (|12[) implies that the configurations 

(Mi tW ,Mi <01 ,Mi t00 ,Mi tll ) - (2,2,0,0) 
(Ml 10 , Ml, 01 , Ml i00 , Mi, u ) = (0, 0, 2, 2) 

(m; 10 , m; 01 , m; 00 , m; u ) = (i, 1, 1, i) 
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have the same probability. That is, by Eq. ©, 

E bioPoi] = E biiPoo] = ^[PwPoipiiPoo]- 

It follows that 

E[(pioPoi -PnPoo) 2 ] = E[Pi Poi +P11P00 ~ 2pi poiPiiPoo] = 0. 



So it must be that P10P01 °= PnPoo- Recall that this condition is familiar from 
Example [5] 

Adding pioPn to both sides of the almost sure equality and then further 
adding pn(poi + P11) to both sides yields 

(pio + Pn)(Poi +Pu) °= Pnipw +P01 +P11 +P00), 
which reduces to 

9i 92 °=' P11 

from the definitions of q\ and qi and from the fact that p\§ +P01 +P11 +Poa = 1. 

By Eq. (fTTJ|) and surrounding text, we see that Eq. (fl2|) implies our model is 
a feature frequency model. Thus, the equivalence between models with a priori 
labeled EFPFs and a priori labeled feature frequency models in the case of two 
features results from simple algebraic manipulations. □ 

Extending the argument above becomes more tedious when more than two 
features are involved. In the case of multiple, or even countably many, labeled 
features, a more elegant proof exists. 

Theorem 17. Consider a model with features a priori labeled 1, 2, 3, . . .. // the 

features are generated from labeled feature frequencies, the probability of an a 
priori labeled feature allocation of [N] with K or fewer features and -Mjv.fc oc- 
currences of feature k for k G {1, . . . , K} takes the form p(N; Mn,i, ■ ■ ■ , M^ t x), 
where we make no symmetry assumptions about p here and note that any of 
Mjv,i, ■ • • , Mjsi^k may be zero. Call p a labeled EFPF. Conversely, if the proba- 
bility of any a priori labeled feature allocation can be written asp(N; M^,\, ■ ■ ■ , Mn^k), 
then the feature allocation has the same distribution as if it were generated from 
labeled feature frequencies. 

Proof. First, consider the claim that every labeled feature frequency model has 
a labeled EFPF. This claim is intuitively clear since the independent Bernoulli 
draws at each atom of the (potentially random) measure B = J^ fc _ 1 Vk5<f, k 
result in a probability that depends only on the number of occurrences of the 
corresponding feature and not any interactions between features. 

To show this direction formally, we consider a fixed, labeled feature allocation 
f N = (Ajv.i, A N<2 , • ■ • , A N , K ) with Afjv,fc := \An,u\ and n °te that 

HFn = In) 

= e[f(F n = f N \B) 



K 



lv k M »>*(i- Vk ) N - M »>* • ; (i-v k y 



L \k=l 
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It follows that P(Fn = /at) has p form. 

Now consider the other direction. We start with a labeled feature allocation 
Fqo. In this case, we know that for every labeled feature allocation of [TV], 

In = (Azv,ij • ■ • , An : k), 

we have that a function p exists in the form 

V{Fn = In) = P{N, IAat.iI, . . . , \A N , K \), (13) 

with no additional symmetry assumptions for p and where the block sizes 
Mjv.ft = |Ajv,k| may be zero. 

Let Z n ,k be one if n belongs to the fcth feature (i.e., n £ Ajv.fe) or zero 
otherwise. Let b\, . . . , b k be values in {0, 1}. Our goal is to show that conditional 
on some (as yet unknown) labeled feature frequencies, the probability of feature 
presence factorizes as independent Bernoulli draws: 

K 

P(Z M = &!,..., Z hK = b K \V u . . , V K ) = J] V k"0- ~ V k ) 1 - bk . (14) 

fe=i 

By the assumption on p, the labeled feature sizes Mjv,i, . . . , Mjv,if are suf- 
ficient for the distribution of the labeled feature allocation. So we start by 
considering 

P(Zi,i = 6i, . . . , Zi )JC = ftjflAfjv,!, . . . , M N , K ) 

K 

= Y[F(Z hk = b k \Z hl = b 1 ,...,Z hk - 1 = b k - 1 ,M Ntl ,...,M N , K ) (15) 
fe=i 

Let £at be the sigma-field of events invariant under permutations of the first TV 
indices. Then again since the feature sizes are sufficient for the feature allocation 
distribution, we have 

P(Zi, fe = b k \Zi A =&i,..., Zi, fe _i = 6fc_i, Mpji, . . . , Mn.k) 
= ¥(Z hk = b k \Z hl = &!,..., Z M _i = 6 fc _i, Civ) (16) 
= P(Zi lfc = 6ik|€iv) 
1 w 

-^P(Z n , fe = & fe |^) 



TV 

n=l 



E 



n=l 

1 ^ 

- £ l{Z n>fc = 6 fe }. 



n=l 



The last line follows since the sum is measurable in £at- By the strong law 
of large numbers, the final sum converges almost surely as TV — > oo to some 
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potentially random value in [0,1]; call it Vk if bk = 1. By Eq. (fT5j) , then, we 
have 

K 

P(Zi,i = &!,..., Zi lJC = 6jf |M w ,i, . . . , M N , K ) ^ J] y fc fc (1 - ^) 1_bfc (17) 

fc=i 

On the other hand, Eqs. (fTB"]) and ([15)1 imply that 

P(Zi,i = &i,..., Zi,k = 6jf |Mjv,i, . . • , M N , K ) 
= V(Z ltl = b 1 ,...,Z liK = b K \t N ). 

We next observe that the righthand side of the above equality is a reverse 
martingale. (£jv) is a reversed filtration since £at 2 £jv+i for all AT. Moreover, 
(1) ¥(Zi t i = bi, . . . ,Z\,k = ^aHCaO is measurable with respect to £jv; (2) the 
same quantity is integrable; and (3) by the tower law, 

P(Z M = h, . . . , Z hK = b K \£ N )\Z N+1 = P(Z M = bi, . . . , Z X . K = b K \£ N+1 ). 

Since P(Zi i i = b±, . . . , Z\_k = &K"|£iv) is a reverse martingale, we have that 

P(Z U =&!,..., Zi iX = 6at|€jv) ^ P(Zi,i = h, . . . , Z ltK = b K \^) 

for £oo = HnLi by reverse martingale convergence. Together with Eq. (|T7|) . 
this convergence implies that 

A' 

p(zi,i=&i,...,2i,x=&*-i&o) = n^ 5 *^-' 7 *) 1 " 6 *' 

k=i 

and since the V& are measurable with respect to £oo, the tower law yields 
Eq. (fT4|) . as was to be shown. □ 

While illustrative, the two previous results do not directly deal with fea- 
ture allocations as defined earlier in this paper; namely, they do not show any 
equivalence between EFPFs and feature frequency models in the case where the 
features are unlabeled (which is exactly the case where EFPFs are defined). We 
will show in the unlabeled case that every feature frequency model has an EFPF 
and that every regular feature allocation with an EFPF is an feature frequency 
model. In fact, we can consider a general — i.e., not necessarily regular — feature 
allocation and characterize the EFPF representation in this case. 

Theorem 18. Let X be a non-negative random variable (which may have some 
arbitrary joint law with the feature frequencies in a feature frequency model). 
We can obtain an exchangeable feature allocation by generating a feature allo- 
cation from a feature frequency model and then, for each index n, including an 
independent Poisson(X)- distributed number of features of the form {n} in addi- 
tion to those features previously generated (which may also include index n). A 
feature allocation of this type has an EFPF. Conversely, every feature allocation 
with an EFPF has the same distribution as one generated by this construction 
for some joint distribution of X and the feature frequencies. 



26 



Proof. Suppose a feature allocation / is generated as described by the construc- 
tion in Theorem [TBI with (potentially random) measure B = Y^kLi ^fc<W giving 
the frequencies in the feature frequency model component. We wish to show 
that the feature allocation has an EFPF. We will make use of the fact that an 
equivalent way to generate the Poisson component of the feature allocation is 
to draw Poisson (NX) singletons and then assign each uniformly at random to 
an index in [N]. 

Consider fx = (A\, A%, ■ ■ . , Ak). Let S = {k : \Ak\ = 1} represent the 
feature indices of the singletons of the feature allocation. These features may 
have been generated either from the feature frequency model or from the Poisson 
component. To find the probability of the feature allocation, we consider each 
possible association of singletons to one of these components. For any such 
association, let S represent those singletons assigned to the Poisson component; 
that is, S C S. Let K = K — \S\ represent the number of remaining features, 
which we denote by 

(A u ...,A k ). 

Then the probability of this feature allocation satisfies 

HFn = M 

= e\f(f n = } n \b,x) 



1 

Kl 



JV^Poisson (§\NX\ ^ 

S-.SCS Cn, ■••,«#) 

distinct 

> 

y\M {l _ . . . v \A R l (1 _ y^N-^ -Q (1 _ y i)N 



( 



The final expression depends only on the number of data points TV and feature 
sizes and is symmetric in the feature sizes. So it has EFPF form. 

In the other direction, we sidestep the issue of feature ordering by looking 
at the number of features to which each data index belongs. The advantage of 
this approach is that this number does not depend on the feature order. The 
following result is the key to making use of this observation. 

Lemma 19. Let K n be a sequence of positive integers. For each n, suppose we 
have (constants) 

1 > Pn,l > Pn,2 > ■ ■> Pn,K„ > 0. 

And, for completeness, suppose p rh k = for k > K n . Let X n ^ ~ Bern(p n ^), 
independently across n and k and with k — 1 : K n . Define ff n := N n ^. 
Then the following are equivalent. 

1. jf n A' ff= for some finite-valued random variable # on {0, 1,2,.. .}. 
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2. There exist (constants) {pk}kLi an d ^ such that pk £ [0, 1] and A > and 
further such that, Vfc = 1, 2, . . 



Pn,k^Pk, n^oo 



(18) 



and 



y^Pn,fc -> y^Pfc + A, n -> oo. 



(19) 



fc=l fc=l 



In i/iis case, we further have 



1 > Pi > P2 > ■ ■ ■ 



(20) 



and 



oo 



(21) 



where Xk ~ Bern(pk), independently across k, and Y ~ Poisson(X). 

The proof of Lemma appears in Appendix [Bj this lemma is essentially a 
special case of a more general result in Appendix [A"l 

In this direction of the proof of Theorem [TBI we want to show that if we 
assume that the probability of a feature allocation takes EFPF form, then the 
allocation has the same distribution as if it were generated according to a feature 
frequency model with a Poisson-distributcd number of singleton features for each 
n. To see how Lemma [19] may be useful, we let # be the number of features 
in which index 1 occurs. Recall that in order to use the EFPF, we apply a 
uniform random ordering to the features of our feature allocation. Examining 
# is advantageous since it is invariant to the ordering of the features, and we can 
thereby avoid complicated considerations that may arise related to the feature 
ordering and consistency of ordering across feature allocations of increasing 
index sets. 

Indeed, recall that once we have chosen a uniform random ordering for the 
features, the EFPF assumption tells us that any feature allocation with the 
requisite feature sizes and number of indices has the same probability. Let Kn 
be the number of features containing indices [TV]. If Mjv,fc is the size of the fcth 
feature (under the uniform random ordering) after TV indices, then there are 



such configurations. Mn,i/N have index 1 in the first feature. For each such 
allocation, there are equally many configurations of the remaining features. So, 
for each such allocation, M/v^/TV have index 1 in the second feature. And so on. 
That is, we have that, conditionally on the feature sizes, the number of features 
with index 1 has the same distribution as a sum of Bernoulli random variables: 





(22) 



fc=i 
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First, we note that the feature sizes are sufficient for the distribution by the 
EFPF assumption. So we may, in fact, condition on £jv, which we define to 
be the sigma-field of events invariant under permutations of the indices n = 
1,.,.,N. That is, #|£iv has the same distribution as the sum in Eq. ([2"2"]l . 

Second, we note that the sum in Eq. (j2"2")l has no dependence on the ordering 
of the features. In particular, then, let I > pn,i > Pn,2 > ■ ■ • > Pn,k n be 
the sizes of the features divided by N and ordered so as to be monotonically 
decreasing. Again, note that we are only considering those features including 
some data index in [N]. It follows that 

K N 

#|&v = -X-Jv.fc, *N,k ™~ V Bem(p N , k ). (23) 
fc=i 

So wc sec that we have circumvented ordering concerns and can simply use a 
size ordering in what follows. 

At this point, it seems natural to apply Lemma 1191 to #|£/v- To do so, 
we need to show that #|£tv converges in distribution to some random variable 
with non- negative integer values as N — > oo. To that end, we note that (£jv) 
is a reversed filtration: £jv 3 <6v+i for all N. And further P(# = j|£at) is a 
reversed martingale since (1) P(# = i|<t/v) is measurable with respect to £_/v; 
(2) P(# = M N ) is integrable; and (3) by the tower law, P(# = MnMn+i = 
P(# = i|CiV+i). It follows that 

and hence 

#l6v — > #|£oo a.s. 

for = Pl^Li by reverse martingale convergence. 

So we may apply Lemma [19] conditional on ^ . By the lemma, we have 
that, conditional on ^oo, 



*;=i 

Y ~ Poisson(A) 
X k in ^Bem(p k ) 

for some A > and some 1 > pi > p2 > ■ ■ ■ . The conditioning on ^ means 
that, in general, A and the frequencies 1 > p\ > p-x > • • • may be positive 
random variables, as was to be shown. □ 

7 Conclusion 

It has been known for some time that the class of exchangeable partitions is 
the same as the class of partitions generated by the Kingman paintbox, which 
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is in turn the same as the class of partitions with exchangeable partition prob- 
ability functions (EPPFs). In this paper, we have developed an analogous set 
of concepts for the feature allocation problem. We defined a feature allocation 
as an extension of partitions in which indices may belong to multiple groups, 
now called features. We have developed analogues of the EPPF and the King- 
man paintbox, which we refer to as the exchangeable feature partition function 
(EFPF) and the feature paintbox, respectively. The feature paintbox allows us 
to construct a feature allocation via iid draws from an underlying collection of 
sets in the unit interval. In the special cases of partitions and feature frequency 
models the construction of these sets is particularly straightforward. 

The Venn diagram presented earlier in Figure Q] summarizes our results and 
also suggests a number of open areas for further investigation. In particular it 
would be useful to develop a fuller understanding of the regularity condition on 
feature allocations that allows the connection to the feature paintbox. It would 
also be of interest to carry the program further by exploring generalizations of 
the partition and feature allocation framework to other combinatorial represen- 
tations, such as the setting in which we allow multiplic ity within, as well as 
across, features ( Broderick et al. . 2011 ; Zhou et al. , 2012t) . 
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A Intermediate lemmas leading to Lemma [19 



To prove Lemma [T9l we will make use of a few definitions and lemmas. We start 
with two definitions. First, suppose we have constants pi, P2,Ps, ■ ■ ■ such that 

1 > Pi > P2 > P3 > ■ ■ ■ > 
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and a constant A such that < A < oo. Then we say that the random variable # 
has the extended Pois son-binomial distribution with parameters (X,p\, P2, . . .) 
if there exist independent random variables Xq, X\, X2, ■ ■ ■ with 

Xo ~ Poisson(A) 

X k ~Bern( Pfc ), k = l,2,... 



such that 



fe=l 

Second, we say that is the spifce size-location measure with parameters 
(A,pi,p2, ■ ■ •) if t l P u ts mass A at and mass p k at pf. for fc = 1,2,.. .. With 
these definitions in hand, we can state the following lemmas. 

Lemma 20. Let # have the extended Pois son-binomial distribution with pa- 
rameters (X,Pl,P2> ■ ■ •)• 
Then 

1. # is a.s. finite if and only i/X)fc°=iPfc < 00 

2. If # is a.s. finite, then the parameters (X,p\,p2, ■ ■ ■) are uniquely deter- 
mined by the distribution of #. 

In particular, since the parameters (X,pi,p2, ■ ■ •) uniquely determine the 
distribution of Lemma [20l tells us that there is a bijection between the dis- 
tribution of # and the parameters (X,p±,p2, ■ ■ •) when # is a.s. finite. See 
Appendix [Cl for the proof of Lemma I2TJ1 

The next lemma tells us that this correspondence between distributions and 
parameters is also continuous in a sense. 

Lemma 21. For n = 1, 2, . . ., let # n have the extended Poisson-binomial dis- 
tribution with parameters (\ n iPn,liPn,2, ■ ■ •)■ Let fi n be the spike size-location 
measure with parameters (\ n ,Pn,liPn,2i ■ ■ ■)■ 

Then the following two statements are equivalent: 

1. converges in distribution to a finite-valued limit random variable. 

2. fj, n converges weakly to some finite measure on [0, 1]. 

If the convergence holds, the limiting random variable (call it #) has an extended 
Poisson-binomial distribution, and the limiting measure (call it pi) is a spike 
size-location measure. In this case, # and \x have the same parameters; call the 
parameters {X,pi,P2, ■ • •)• 

This lemma is suggested by, and provides an extension to, previous results on 
tri angular arrays of ra ndom variables with row sums converging in distribution; 
cf., iKallenbera ( 2002 ). See Appendix [Dl for the proof of Lemma [2X1 

Lemma [19] highlights a special case of Lemmas [20] and [21] that we use to 
prove the equivalence in Theorem ll8l 
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B Proof of Lemma [19 



We can rephrase the statement of Lemma [19] in terms of the terminology intro- 
duced in Appcndix[S] In particular, we are given a sequence of random variables 
where has an extended Poisson-binomial distribution with parameters 
(0, Pn,i, Pn,2, ■ ■ ■ ,Pn,K n , 0, 0, . . .). Then we see that Lemma Q33 is essentially a 
special case of Lemma [2D where A„ and all but finitely many of the p n ,k are 
equal to zero. Indeed, the extended Poisson-binomial distributi on in exactly 
this special case is known as the Poisson-binomial distribution ( Wand . 1993 ; 



Chen and Liul . ll997ft 



(1) (2). We assume that # n converges in distribution to some finite- valued 
random variable #, and we wish to show that the p n ,k converge to some limiting 
Pk as n — >• oo for each k, and likewise that Y^,k=i Vn,k converges to YlkLi P k + ^ 
for some non- negative constant A. The p n ,k are just the ordered atom sizes 
of the spike size-location measures \i n in Lemma [2TJ By Lemma [21] the \i n 
converge weakly to some spike size-location measure fi. This convergence yields 
both the desired convergence of the atom sizes (Eq. (fl8|) , repeated here) 

Pn,k -*Pk, n -> oo 
and the desired convergence of the total mass of p„ (Eq. ([T9)) . repeated here) 

K n oo 

^Pn,k ->• ^Pk + A, n^oo. 
fc=i fc=l 

(2) =>■ (1). Now we assume that the p n ^ converge to some limiting pk as 

n — > oo for each k, and likewise that Ylk=iP n > k conver g es to Y^k=xP k + ^ ^ or 
some appropriate positive constants {pk}, A. We wish to show that #„ converges 
in distribution to some finite- valued random variable 

The assumed convergences guarantee the weak convergence of the spike size- 
location measures p n to some finite measure on [0,1]. Lcmma[5T]thcn guarantees 
that converges in distribution to some finite- valued random variable 

Assume (1) and (2). We wish to show that 1 > pi > p 2 > . . ■ (Eq. (|20])). 
but this result follows from the monotonicity of the p n ,k- 

Eq. (|2ip in the original lemma statement can be rephrased as wanting to 
show that # has the extended Poisson-binomial distribution with parameters 
(X,Pi,P2, ■ ■ ■)■ This follows directly from the final statement in Lemma [2"T1 and 
our identification of the limiting spike size-location measure p as having param- 
eters (X,pi,p2, . . .) in a previous part of this proof ("(1) (2)"). □ 



C Proof of Lemma [20 

Throughout we assume that # has the extended Poisson-binomial distribution 
with parameters (A,pi,p2, ■ • ■)• 
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(1) . We want to show that # is a.s. finite if and only if Y^k=iPk < °°- Since 
# is extended Poisson-binomially distributed, we can write # = Xq + J2h=i Xk 
for independent Xq ~ Poisson(A) and Xk ~ Bern(pfc) for fe = 1, 2, First sup- 
pose J2T=iPk < 00 • Then X)fc=i Xk is a.s. finite by the Borel-Cantelli lemma. 
Second, suppose J2kLiPk = 00 • Then J^fcLi -^fc is a - s - infinite by the second 
Borel-Cantelli lemma. Since Xq is a.s. finite by construction, the result follows. 

(2) . We want to show that if # is a.s. finite, then the parameters (X,pi,p2, ■ ■ •) 
are uniquely determined by the distribution of To that end, let \i be the 
spike size-location measure with parameters {X,p\,p2, . . .) . Note that /x need 
not be a probability measure but is finite by the assumption that # is a.s. finite 
together with part (1) of the lemma. 

To better understand the distribution of we write the probability gener- 
ating function of For s with < s < 1, we have 

oo 

Es# = e-^ 1 -^ H [1 (1 s)p k ] , 

k=l 

which implies that for s with < s < 1 we have 

oo 

- logEs # = A(l - s) - log [1 - (1 - s)p k ] (24) 
fc=i 

= A(i-«)+f;f;i(i- fl )V fc 

fe=l 3=1 J 

from the Taylor series expansion of the logarithm 

! pi 



j=i J k=i 

interchanging the order of summation since the summands are non-negative 



oo 

= (i-*Mo} + £-(i-*y / 



■P^^dx) (25) 

(0,1] 



oo 1 

= ^-(1-^, (26) 

3=1 J 

where 

77z j := / x 3 n(dx) 

J 10,1} 

is the jth moment of the measure \i. 

Now the distribution of # uniquely determines the probability generating 
function of which by Eq. (|26[) uniquely determines the sequence of moments 
of the measure \i. In turn, /j is a bounded measure on [0, 1] and hence uniquely 
determined by its moments. And the parameters {\,pi,P2, ■ ■ •) are uniquely 
determined by ji. □ 
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D Proof of Lemma [21 



For n = 1,2,..., we assume ^ n has the extended Poisson-binomial distribution 
with parameters (A n ,p ni i,p„.2, ■ • ■)■ We further assume fi n has the spike size- 
location measure with parameters (\ n ,Pn,i,Pn,2, ■ ■ ■)• 

(2) => (1). Suppose the \i n converge weakly to some finite measure /i on 
[0,1]. We want to show that # ra converges in distribution to a finite- valued 
limit random variable. 

In Appendix [C] we noted that we can express the probability generating 
function of an extended Poisson-binomial distribution in terms of a spike size- 
location measure with the same parameters. In particular, by Eq. (|25l) . we can 
write the negative log of the probability generating function of # n as 

-logE.s # " = / f s {x) Hn(dx), 
J[o,l] 

where 

/.(*) : £ 1(1 - s Y x^ = { "f 1 ^ ' " °™ 1>° Q . (27) 

3=1 3 L 

Since f s (x) is bounded in x for each fixed s with < s < 1, we have by the 
assumption of weak convergence of \x n that 

lim — logEs # " = / f s {x) n{dx), 
n ^°° J [0,1] 

Moreover, since [i is finite by assumption, we have that the result is finite for 
each s with < s < 1. It follows that converges in distribution to a finite 
random variable #, with probability generating function given by 

Es # = exp|-^ ^f s (x) n(dx)\ . (28) 

Assume (1). Now suppose the converge in distribution to a finite random 
variable The next two parts of the proof will rely on an intermediate step: 
showing that pi n has bounded total mass in this case. 

To show that fj, n has bounded total mass, first note that E#„ is exactly the 
total mass of 

oo 

E#„ = A n + y^Pn.fc =: S„, 
fc=l 

oo 

and Var#„ = A„ + ^JJ„,fc(l - p n ,k)- 

k=l 
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Noting that Var#„ < S n allows us to apply Chebyshev's inequality to find 

1/4 > P(|#„ - E#„| > 2VVar#„) 
3/4 < P(|#„ - E n | < 2^Var# n ) 

<P(|#„-E„| <2v/s^) 

< P(#„ > E„ - 2^). 

Since # n converges in distribution by assumption, the sequence is tight. 
Choose e such that 1/2 > e > 0. Then there exists some N £ such that, for all 
n > 1, we have P(#„ < iV e ) > 1 - e > 1/2. It follows that, for all n > 1, 

1/4 < ¥(N e > E„ - 2E„). 

Since E n is non-random, it must be that P(iV e > E„ — 2^/Y, n ) = 1. That is, the 
total mass of /Lt n is bounded. 

Assume (1) and (2). Suppose #„ converges in distribution to some finite- 
valued limit random variable # and that n„ converges weakly to some finite 
measure /i. We want to show that # has an extended Poisson-binomial distri- 
bution, that fi is a spike size-location measure, and that # and y, have the same 
parameters. 

We start by showing that [i is discrete. Choose any e > 0. Since the mass of 
H n is bounded across n by the previous part of the proof ("Assume (1)"), the 
number of atoms of /i„ greater than e is bounded across n. It follows that the 
number of atoms of [i has the same bound. So fi is discrete. Since \x n converges 
weakly to /i, we see that /i must have atoms with sizes and locations Pi,p%, ■ ■ ■ 
such that 

1 > Pi > P2 > ■ ■ • 

as well as a potential atom, with size we denote by A, at zero. That is, /i is a 
spike size-location measure with parameters (X,pi,p2, ■ ■ •)• 

In a previous part of the proof ("(2) => (1)"), we expressed the probability 
generating function of # as a function of /i (Eq. ([28]) ). With this relation in hand, 
we can reverse the series of equations presented in Appendix [C] and ending in 
Eq. (f25|) to find the form of the probability generating function for # (Eq. ((24)) ). 
In particular, Eq. tells us that # is an extended Poisson-binomial random 
variable with parameters (\,pi,p 2 , . . .). In particular, we emphasize that # 
has the same parameters as /i, which we have already shown above is a spike 
size-location measure. 

(1) => (2) Now step back and assume that converges in distribution to a 
finite- valued limit random variable; call it We wish to show that \i n converges 
weakly to some finite measure on [0,1]. 

By a previous part of this proof ("Assume (1)"), the mass of /i n is bounded 
across n. Moreover, by construction, all of the mass for each /i„ is concentrated 



36 



on [0, 1]. So it must be that the sequence fj,„ is tight. It follows that if every 
weakly convergent subsequence /x„ . has the same limit /x, then fi„ converges 
weakly to [i. 

Consider a subsequence (rij)j of N. Wc know # nj converges in distribution 
to # by the assumption that converges in distribution to The previous 
part of this proof ("Assume (1) and (2)") gives that the form of the limit of 
fi n . is determined by namely, the limit is a spike size-location measure with 
parameters shared by In particular, then, the limit [i must be the same for 
every subsequence, and the desired result is shown. □ 
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